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Abstract. One of the most important problems in large scale inference problems is the 
identification of variables that are highly dependent on several other variables. When depen- 
dency is measured by partial correlations these variables identify those rows of the partial 
correlation matrix that have several entries with magnitude close to one; i.e., hubs in the 
associated partial correlation graph. This paper develops theory and algorithms for discov- 
ering such hubs from a few observations of these variables. We introduce a hub screening 
framework in which the user specifies both a minimum (partial) correlation p and a minimum 
degree <5 to screen the vertices. The choice of p and S can be guided by our mathematical 
expressions for the phase transition correlation threshold p c governing the average number 
of discoveries. We also give asymptotic expressions for familywise discovery rates under 
the assumption of large p, fixed number n of multivariate samples, and weak dependence. 
Under the null hypothesis that the covariance matrix is sparse these limiting expressions can 
be used to enforce FWER constraints and to rank these discoveries in order of statistical 
significance (p- value). For n -C p the computational complexity of the proposed partial cor- 
relation screening method is low and is therefore highly scalable. Thus it can be applied to 
significantly larger problems than previous approaches. The theory is applied to discovering 
hubs in a high dimensional gene microarray dataset. 
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I. Introduction 

This paper treats the problem of screening a p-variate sample for strongly and multiply 
connected vertices in the partial correlation graph associated with the the partial correlation 
matrix of the sample. This problem, called hub screening, is important in many applications 
ranging from network security to computational biology to finance to social networks. In the 
area of network security, a node that becomes a hub of high correlation with neighboring 
nodes might signal anomalous activity such as a coordinated flooding attack. In the area 
of computational biology the set of hubs of a gene expression correlation graph can serve 
as potential targets for drug treatment to block a pathway or modulate host response. In 
the area of finance a hub might indicate a vulnerable financial instrument or sector whose 
collapse might have major repercussions on the market. In the area of social networks a 
hub of observed interactions between criminal suspects could be an influential ringleader. 
The techniques and theory presented in this paper permit scalable and reliable screening for 
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such hubs. Unlike the correlations screening problem studied in [12] this paper considers the 
more challenging problem of partial correlation screening for variables with a high degree of 
connectivity. In particular we consider 1) extension to the more difficult problem of screening 
for partial correlations exceeding a specified magnitude; 2) extension to screening variables 
whose vertex degree in the associated partial correlation graph 1 exceeds a specified degree. 

The hub screening theory presented in this paper can be applied to structure discovery 
in p-dimensional Gaussian graphical models, a topic of recent interest to statisticians, com- 
puter scientists and engineers working in areas such as gene expression analysis, information 
theoretic imaging and sensor networks [7] [15], [21]. For example, the authors of [1] propose a 
Euclidean nearest neighbor graph method for testing independence in a Gauss-Markov ran- 
dom field model. When specialized to the null hypothesis of spatially independent Gaussian 
measurements, our results characterize the large p phase transitions and specify a weak 
(Poisson-like) limit law on the number of highly connected nodes in such nearest neighbor 
graphs for finite number n of observations. 

Many different methods for inferring properties of correlation and partial correlation ma- 
trices have been recently proposed [8], [18], [19], [3], [16]. Several of these methods have been 
contrasted and compared in bioinformatics applications [9], [14], [17] similar to the ones we 
consider in Sec. 5. The above papers address the covariance selection problem [6]: to find 
the non-zero entries in the covariance or inverse covariance matrix or, equivalently, to find 
the edges in the associated correlation or partial correlation graph. 

The problem treated and the solution proposed in this paper differ from those of these 
previous papers in several important ways: 1) as contrasted to [6] our objective is to screen 
for vertices in the graph instead of to screen for edges; 2) unlike [3] our objective is to 
directly control false positives instead of maximizing a likelihood function or minimizing a 
matrix norm approximation error; 3) our framework is specifically adapted to the case of a 
finite number of samples and a large number of variables (n < p); 4) our asymptotic theory 
provides mathematical expressions for the p-value for each of the variables with respect to 
a sparse null hypothesis on the covariance; 5) unlike lasso type methods like [16] the hub 
screening implementation can be directly applied to very large numbers of variables without 
the need for initial variable reduction. Additional relevant literature on correlation based 
methods can be found in [12]. 

For specified p G [0, 1] and 5 £ I, . . . ,p — 1, a hub is defined broadly as any variable that 
is correlated with at least 5 other variables with magnitude correlation exceeding p. Hub 
screening is accomplished by thresholding the sample correlation matrix or partial correla- 
tion matrix and searching for rows with more than 5 non-zero entries. We call the former 
correlation hub screening and the latter partial correlation hub screening. The screening is 
performed in a computationally efficient manner by exploiting the equivalence between cor- 
relation graphs and ball graphs on the set of Z-scores. Specifically, assume that n samples 
of p variables are available in the form of a data matrix where n < p. First the columns of 
the data matrix are converted to standard n-variate Z-scores. The set of p Z-scores uniquely 
determine the sample correlation matrix. If partial correlations are of interest, these Z-score 
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are replaced by equivalent modified Z-scores that characterize the sample partial correlation 
matrix, defined as the Moore-Penrose pseudo-inverse of the sample correlation matrix. Then 
an approximate k-nearest neighbor (ANN) algorithm is applied to the Z-scores or the modi- 
fied Z-scores to construct a ball graph associated with the given threshold p. Hub variables 
are discovered by scanning the graph for those whose vertex degree exceeds 5. The ANN 
approach only computes a small number of the sample correlations or partial correlations, 
circumventing the difficult (or impossible) task of computing all entries of the correlation 
matrix when there are millions (or billions) of variables. State-of-the-art ANN software has 
been demonstrated on over a billion variables [13] and thus our proposed hub screening 
procedure has potential application to problems of massive scale. We also note that using 
the standard Moore-Penrose inverse is well understood to be a sub-optimal estimator of the 
partial correlation matrix in terms of minimum mean square error [10]. To our knowledge 
its properties for screening for partial correlations has yet to be investigated. This paper 
demonstrates that its potential use in detecting high partial correlations as compared to 
estimating them. 

No screening procedure would be complete without error control. We establish limiting ex- 
pressions for mean hub discovery rates. These expressions are used to obtain an approximate 
phase transition threshold p c below which the average number of hub discoveries abruptly 
increases. When the screening threshold p is below p c the discoveries are likely to be dom- 
inated by false positives. We then show that the number of discoveries is dominated by a 
random variable that converges to a Poisson limit as p approaches 1 and p goes to infinity. 
Thus the probability of making at least one hub discovery converges to a Poisson cumulative 
distribution. In the case of independent identically distributed (i.i.d.) elliptically distributed 
samples and sparse block diagonal covariance matrix, the mean of the dominating variable 
does not depend on the unknown correlations. In this case we can specify asymptotic p- values 
on hub discoveries of given degree under a sparse-covariance null model. Finite p bounds on 
the Poisson p-value approximation error are given that decrease at rates determined by p, 
5, p, and the sparsity factor of the covariance matrix. 

To illustrate the power of the hub screening methods we apply them to a public gene 
expression datases: the NKI breast cancer data [4]. Each of these datasets contains over 
twenty thousand variables (genes) but many fewer observations (GeneChips). The screening 
reveals interesting and previously unreported dependency structure among the variables. For 
the purposes of exploring neighborhood structure of the discoveries we introduce a waterfall 
plot of their approximate p-values that plots the family of degree-indexed p-value curves 
over the range of correlation thresholds. This graphic rendering can provide insight into the 
structure and significance of the correlation neighborhoods as we sweep the variables over 
different vertex degree curves in the waterfall plot. 

The outline of this paper is as follows. In Sec. 2 we formally define the hub screening 
problem. In Sec. 2.2 we present the Z-score representation for the pseudo-inverse of the 
sample correlation matrix. In Sec. 3 we give an overview of the results pertaining to phase 
transition thresholds and establish general limit theorems for the familywise discovery rates 
and p-values. Section 4 gives the formal statements of the results in the paper. The proofs 
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of these results are given in the appendix. In Sec. 5 we validate the theoretical predictions 
by simulation and illustrate the application of hub screening to gene microarray data. 

2. Hub screening framework 

Let the p-variate X = [X 1 , . . . ,X p ] T have mean p and non-singular p x p covariance 
matrix S. We will often assume that X has an elliptically contoured density: /x( x ) — 
g ((x — /x) T E _1 (x — //)) for some non-negative strictly decreasing function g on R + . The 

correlation matrix and the partial correlation matrix are defined as T = D E 1/2 £D E 1/2 and 
f2 = D~i( 2 S~ 1 D~i{ 2 , respectively, where for a square matrix A, = diag(A) denotes the 
diagonal matrix obtained from A by zeroing out all entries not on its diagonal. 

Available for observation is a n x p data matrix X whose rows are (possibly dependent) 
replicates of X: 

^ = [Xi, • • • , X p ] = [x^, • • • , X^] T , 

where Xj = [Xu, . . . , X ni ] T and X(j) = [X a , . . . ,X ip \ denote the i-th column and row, 
respectively, of X. Define the sample mean of the i-th column Xj = n _1 X^=i -^-jii vector 
of sample means X = [Xi, . . . , X p ], the pxp sample covariance matrix S = ^-j- ^r=i(X(?) — 
X) T (X(j) — X), and the pxp sample correlation matrix 

R = D S 1/2 SD S 1/2 . 

For a full rank sample correlation matrix R the sample partial correlation matrix is defined 

as 

P = D-^R^D-M 2 . 

In the case that R is not full rank this definition must be modified. Several methods have 
been proposed for regularizing the inverse of a rank deficient covariance including shrinkage 
and pseudo-inverse approaches [20]. In this paper we adopt the pseudo-inverse approach and 
define the sample partial correlation matrix as 

P - D 1/2 R f D 1/2 

where Rt denotes the Moore-Penrose pseudo-inverse of R. 

Let the non-negative definite symmetric matrix <fr = ((3>y))fj=i be generic notation for 
a correlation- type matrix like T, f2, R, or P. For a threshold p G [0, 1] define G p (*&) the 
undirected graph induced by thresholding $ as follows. The graph G p (&) has vertex set 
V = {1, . . . ,p} and edge set £ = {eij}i,je{i,...,p}-.i<j, where an edge G S exists in G P (&) if 
> p. The degree of the i-th vertex of G P (*&) is \{j ^ i : > p}\, the number of edges 
that connect to %. We call ^o(^) the population correlation or partial correlation graph 
[5] depending on whether $ is defined as Y or fl. Likewise we call G p (&) the empirical 
correlation or partial correlation graph depending on whether $ = R or $ = P. 

A p x p matrix is said to be row-sparse of degree k, called the sparsity degree, if no row 
contains more than k + l non-zero entries. When $ is row-sparse of degree k its graph G P {&) 
has no vertex of degree greater than k. A special case is a block-sparse matrix of degree k; 
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a matrix that can be reduced via row-column permutation to block diagonal form with a 
single k x k block. 

2.1. The hub screening problem. Assume that a single treatment of the p variables yields 
data matrix X of dimension n x p. A given vertex i is declared a hub screening discovery at 
degree level 5 and threshold level p if 

(1) \{j:j^i,\®ij\>p}\>6, 

where <fr is equal to R for correlation hub screening or is equal to P for partial correlation 
hub screening. We denote by N S:P e {0, . . . ,p} the total number of hub screening discoveries. 

To be practically useful, we need guidelines for selecting the threshold screening parameters 
in (1). In the sequel we will develop a large p asymptotic analysis to address the following 
issues: 1) phase transitions in the number of discoveries as a function of these screening 
parameters; 2) relations between the false positive rate of the discoveries and these screening 
parameters. 



2.2. Z-score representation. Define the nxp matrix of Z-scores associated with the data 
matrix X 

(2) T = [T 1? . . . , TJ = (n - ir 1 ' 2 ^ - n- 1 ll T )XD s 1/2 , 

where I n is the nxn identity matrix and 1 = [1, . . . , 1] T G R ra . This Z-score matrix is to 
be distinguished from the (n — 1) x p Z-score matrices U and Y, denoted collectively by the 
notation Z in the sequel, that are derived from the matrix T. 

We exploit the following Z-score representation of the sample correlation matrix 

(3) R = T T T, 

and defined a set of equivalent but lower dimensional Z-scores called U-scores. The U- 
scores lie in the unit sphere S , „_ 2 in R™ -1 and are obtained by projecting away the rowspace 
components of T in the direction of vector 1. More specifically, they are constructed as 
follows. 

Define the orthogonal nxn matrix H = [n~ 1//2 l, H 2:n ]. The matrix H 2:n can be obtained 
by Gramm-Schmidt orthogonahzation and satisfies the properties 

l r H= [v^,0,...,0], H 2:n T H 2:n = I n _x. 

The U-score matrix U = [Ux, . . . , U p ] is obtained from T by the following relation 



(4) 



T 

u 



= H T T. 



Lemma 1. Assume that n < p. 
tation 

(5) 



The Moore-Penrose pseudo-inverse of R has the represen- 

Rt = u T [uu T ]- 2 u. 
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The proof of the Lemma simply verifies that Q = U T [UU :r ]~ 2 U satisfies the Moore- 
Penrose conditions for Q to be the unique pseudo-inverse of R: 1) the matrices QR and 
RQ are symmetric; 2) RQR = R; and 3) QRQ = Q [11]. 

Define the (n — 1) x p matrix of partial correlation Z-scores Y = [Y 1; . . . , Y p ], Yj e S n _2- 

(6) Y = [UU^J^UD-^,.^. 

With this definition Lemma 5 gives the following Z-score representation for the sample partial 
correlation matrix 

(7) P = Y T Y. 



3. Overview of results 



3.1. Phase transitions for hub discoveries. There is a phase transition in the number of 
discovered variables as a function of the applied screening threshold(s). This phase transition 
threshold, which we call p c> s, is such that if the screening threshold p decreases below p Cj g, the 
mean number E[Ns jP ] of hub discoveries of degree 5 abruptly increases to p. An expression 
for the critical threshold is obtained from the asymptotic expression (18) for the mean given 
in Prop. 2 in the Appendix 

(8) Pc ,S = Jl-(Cn, S (p-l))-* S /M n -Q- 2 ), 

where c n ^ = a n 5J Pi< 5and a n = 2B((n— 2)/2, 1/2) with B(i,j) the beta function. The unknown 
covariance matrix £ influences p Ct $ only through the quantity J Pi< s = J(/z* 1 ,...,z» i+1 ), defined 
in (30), which is a measure of average (5+ l)-order dependency among the Z-scores {Zj}? =1 . 
For large p the constant c n ^ depends only weakly on p and the critical threshold increases 
to 1 at rate 0((p — l) -25 ^ 71-2 ) -2 )), which is close to logarithmic in p for large n but much 
faster than logarithmic for small n. 

When the rows of X are i.i.d. elliptically distributed and £ is block-sparse of degree k 
then, from Prop. 3 

(9) ■/(/z. 1 ,..,z. 4+1 ) = l + 0((fc/p)'»), 

where 75 = 5 + 1 for correlation hub screening and 75 = 1 for partial correlation hub 
screening. When p is large O ((k/p) ls ) goes to zero and the phase transition threshold p c> s 
becomes independent of S. 

3.2. Asymptotic p-values on hub discoveries. As p — > 00 Prop. 2 in the Appendix 
states that N$ :P is dominated by an asymptotically Poisson random variable Nt if: 1) 
p = p p increases to one at a prescribed rate depending on n; 2) the sparsity degree increases 
only as k = o(p); and 3) the dependency coefficient HAp^^Hi, defined in (28), converges to 
zero. This guarantees that P(N$ tP > 0) converges to the Poisson probability 1 — exp(— As jP ) 
where A<5 iP is the rate parameter of Nt . The rate of convergence is provided in Prop. 1 
along with a finite p approximation for Ag jP 

A 5iP = A 5)P J(/ Zn! ... )Za , 5+i ), 
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with 

(10) \ p = ^p( P ~ l \p Q {p p ,n)\ 

and Pq(p, n) is the spherical cap probability defined in (23). The parameter \s tP does not 
depend on the underlying distribution of the observations. 

Furthermore, when the rows of X are i.i.d. elliptically distributed with covariance that is 
block-sparse of degree k, Prop. 3 asserts that (9) holds and that the dependency coefficient 
II Ap,n,k,<$ Hi is equal to zero for correlation hub screening and of order 0(k/p) for partial 
correlation hub screening. Therefore, when the block-sparse model is posed as the null 
hypothesis, Prop. 2 implies that the false positive FWER error rate can be approximated 
by 

(11) P(JV* )P >0)«l-exp(-A*, p ). 

Approximate p- values can also be obtained under the block-sparse null hypothesis. Assume 
that for an initial threshold p* e [0, 1] the sample correlation or partial correlation graph 
G p *(<&) has been computed for <& equal to R or P. Consider a vertex % in this graph that 
has degree di > 0. For each value 8 e {1, . . . , maxj = i v .. jP di} of the degree threshold S denote 
by pi (5) the maximum value of the correlation threshold p for which this vertex maintains 
degree at least 6 in Q p (&). Pi(5) > p* and is equal to the sample correlation between Xj and 
its 5-th nearest neighbor. We define the approximate p-value associated with discovery i at 
degree level 5 as 

(12) pv s (i) ~ 1 - exp(-A <5iPi((5) ). 

The quantity (12) is an approximation to the probability of observing at least one vertex 
with degree greater than or equal to 5 in G p (&) under the nominal block-sparse covariance 
model. 

The accuracy of the approximations of false positive rate (11) and p-value (12) are specified 
by the bound (17) given in Prop. 1. Corollary 1 provides rates of convergence under the 
assumptions that p(p — 1) 5 (1 — p 2 )( n ~ 2 )/ 2 = 0(1) and the rows of X are i.i.d. with sparse 
covariance. For example, assume that the covariance is block-sparse of degree k. If k does 
not grow with p then the rate of convergence of P(Ns^ p > 0) to its Poisson limit is no worse 
than 0(p~ 1 ^ 6 ) for 5 > n — 3. On the other hand, if k grows with rate at least 0(p 1 ~ a ), for 
a = min{(5 + l) -1 , (n — 2) _1 }/5, the rate of convergence is no worse than O (k/p). This 
latter bound can be replaced by O ((k/p) 5+1 ) for correlation hub screening under the less 
restrictive assumption that the covariance is row-sparse. 



More generally the combination of Prop. 1 and the assertions (Prop. 3) that J(/z» 1 ,...,z» ) — 
1 + O ((k/ P y s ) and ||A p , niM ||i < 0(k/p) yields 

\P(N >0) M exnf A ))\ < I ° ^ K k/pV^p-^ik/p)^^ , (1 - p)V 2 }) , 5 > 1 
\P(N s , p > 0) - (1 - exp(-A 5 , p ))| < | Q ^ {(Jfc/p)74j (1 _ p) i/ 2}) g = 1 

The terms (k/p) ys , p~^ 5 ~ 1 ^ 5 (k/p) 5 ~ 1 ,p~ 1 ^ 5 and (1 — p) 1 / 2 respectively quantify the contribu- 
tion of errors due to: 1) insufficient sparsity in the covariance or, equivalently, the correlation 
graph; 2) excessive dependency among neighbor variables in this graph; 3) poor convergence 
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of E[Ns tP \; and 4) inaccurate mean- value approximation of the integral representation of 
limp^oo E[Ns tP ] by (35). One of these terms will dominate depending on the regime of oper- 
ation. For example, specializing to partial correlation hub screening (75 = 5 + 1), if 5 > 1 
and O (pM'W)/^)) <k< o(p) then (k/p) 5+l > p'^^/^k/p) 6 - 1 and the deficiency in the 
Poisson probability approximation will not be the determining factor on convergence rate. 

3.3. P-value trajectories and waterfall plots. As defined above, the approximate p- 
values provide a useful statistical summary. Rank ordering and thresholding the list {pvs(i)}^ =1 
of p-values at any level a G [0, 1] yields the set of vertices of degree at least 5 that would 
pass a test of significance at false positive FWER level a. Additional useful information can 
be gleaned by graphically rendering the aggregate lists of p-values as described below. 

Specifically, assume that correlation screening generates an associated family of graphs 
{Gp($)} pe \ Q i\- Let <i max = maxj = i ; ... iP di be the maximum discovered degree in the initial 
graph Gp*{&). We define the waterfall plot of p-values as the family of curves, plotted 
against the thresholds Pi(6), indexed by 8 e {1, 5, <i max } where the 5-th curve is formed from 
the (linearly interpolated) ranked ordered list of p-values {pvs(ij)}? =1 , pvs(ii) > . . . > pvs(i P ) 
(see Fig. 2). 

A useful visualization of the evolution of vertex neighborhoods over the family {Q p (&)} pe ^ Q ^ 
is the p-value trajectory over the waterfall plot. This trajectory is defined as the ordered 
list {pvs(i)}gZT defined for a given vertex i. All p-value trajectories start at the outermost 
curve (curve associated with 5 — 1) on the waterfall plot and extinguish at some inner curve 
(associated with increasing 5 > 1). Vertices with the tightest large neighborhoods will tend 
to have long trajectories that start in the middle of the outer curve and extinguish at a 
curve deep in the waterfall plot, e.g., the variable labeled ARRB2 in Fig. 2, while vertices 
with the tightest small neighborhoods will tend to have short trajectories that start near 
the extremal point of the outer curve, e.g., the variable labeled CTAG2 in Fig. 2 whose 
trajectory extinguishes near the bottom right corner of waterfall plot. 



4. Main theorems 

The asymptotic theory for hub discovery in correlation and partial graphs is presented in 
the form of three propositions and one corollary. Prop. 1 gives a general bound on the finite 
sample approximation error associated with the approximation of the mean and probability 
of discoveries given in Prop. 2. The results of Props. 1 and 2 apply to general random 
matrices of the form Z T Z where the p columns of Z lie on the unit sphere S n -2 C R™ -1 and, 
in view of (4) and (7), they provide a unified theory of hub screening for correlation graphs 
and partial correlation graphs. Corollary 1 specializes the bounds presented in Prop. 1 to 
the case of sparse correlation graphs using Prop. 3. 

For 5 > 1, p E [0,1], and <fr equal to the sample correlation matrix R or the sample partial 
covariance matrix P we recall the definition of N$, p as the number of vertices of degree at 
least 5 in Q p {&). Define Ns, P as the count of the number of groups of S mutually coincident 



HUB DISCOVERY IN PARTIAL CORRELATION GRAPHICAL MODELS 9 

edges in G P {&) 2 - In the sequel we will use the key property that Ns tP = if and only if 
N S ,p = 0. 

For S > 1, p e [0, 1], and n > 2 define 



( 13 ) A - (paV^/z.! ,-,z. 4+1 ) 

where 



(14) W,P = pf /W> 



P = P (p,n) is defined in (23), J is given in (30), and /z n ,...,z a<5+1 is the average joint density 
given in (27). 

We also define the following quantity needed for the bounds of Prop. 1 

(15) Vp,s=P L/s (p-l)Po- 

Note that £, P ,n,6, P /Vp,5 = ( a n/(^ — 2)) <5 /5! to order O (max-jjo" 1 , 1 — p}), where a„ = (2r((n — 
l)/2))/(v / 7rr((n — 2)/2)). Let <p(5) be the function equal to 1 for 5 > 1 and equal to 2 for 
5 = 1. 

Proposition 1. Let Z = [Z 1; . . . , Z p ] &e a (n — 1) x p random matrix with Zj G S , n _2- Fix 
integers 5 and n where 5 > 1 and n > 2. Lei ine jom£ density of any subset of the Zj 's &e 
bounded and differentiate . Then, with A defined in (13), 

(16) |£[AU " Al ^O^maxj^p-^^l-p) 1 / 2 }) 

Furthermore, let Ng be a Poisson distributed random variable with rate E[Nt ] = A/ip(5). 
If (p — 1)P < 1 #iera, /or any integer k, 1 < k < p, 

(17) |P(A^>0)-P(iV* p >0)|< 
O (vis max {p* 5 (A;/p) m , Q pA& , || A^H^p" 1 /^ (1 - p )V2 ) 5 > i 

O (r} pA max {p Pjl (/c/p) 2 , HAp^J^p -1 , (1 - p) 1/2 }) , 5=1 
with Q P:k ,8 = r] 5 ~g L p~ (<5_i: )/ 5 {k/p) 5 ' 1 and ||Ap )njfcj<5 ||i de/iraed m f^P/ 

The proof of the above proposition is given in the Appendix. The Poisson-type limit 
(19) is established by showing that the count N Pt g of the number of groups of 5 mutually 
coincident edges in Q p converges to a Poisson random variable with rate A/ip(5). 

Proposition 2. Let p p E [0, 1] be a sequence converging to one as p — >■ oo such thatp x l & {j> — 
1)(1 " P 2 ) (n ~ 2)/2 e n , s E (0, oo). Then 



(18) lim E[N SiPp ] = K n ,g hm J(/z n ,...,z„ +1 ), 



where K n> s = (e nj ga, n /(n — 2)) 5 /£!. Assume that k = o(p) and £na£ iae weafc dependency 
condition limp.^ || Ap^^Hi = is satisfied. Then 

(19) P(N 5 , Pp > 0) -+ 1 - exp(-AM(J)). 



iV{ is equivalent to the number of subgraphs in Q p that are isomorphic to a star graph with 5 vertices. 
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The proof of Prop. 2 is an immediate and obvious consequence of Prop. 1 and is omitted. 

Propositions 1 and 2 are general results that apply to both correlation hub and partial 
correlation hub screening under a wide range of conditions. Corollary 1 specializes these 
results to the case of sparse covariance and i.i.d. rows of X having elliptical distribution. 
These are standard conditions assumed in the literature on graphical models. 

Corollary 1. In addition to the hypotheses of Prop. 2 assume that n > 3 and that the rows 
o/X are i.i.d. elliptically distributed with a covariance matrix £ that is row-sparse of degree 
k. Assume that k grows as O (p 1-a ) < k < o(p) where a = min{(5 + l) -1 , (n — 2) -1 } /5. 
Then, for correlation hub screening the asymptotic approximation error in the limit (19) 
is upper bounded by O ((k/p) s+1 ). Under the additional assumption that the covariance is 
block-sparse, for partial correlation hub screening this error is upper bounded by O (k/p). 

The proof of Corollary 1 is given in the Appendix. The proposition below specializes these 
results to sparse covariance. 

Proposition 3. Let X be a n x p data matrix whose rows are i.i.d. realizations of an ellip- 
tically distributed p-dimensional vector X with mean parameter fi and covariance parameter 
S. Let U = [Ui, . . . , Up] be the matrix of correlation Z-scores (4) and Y = [Y 1; . . . , Y p ] be 
the matrix of partial correlation Z-scores (6) defined in Sec. 2.2. Assume that the covari- 
ance matrix S is block-sparse of degree q. Then the pseudo -inverse partial correlation matrix 
P = Y T Y has the representation 

(20) P = V T V(l + 0(q/p)). 

Let Zj denote Ui or Yj and assume that for 5 > 1 the joint density of any distinct set of 
Z-scores U^, . . . , Uj 4+1 is bounded and differentiable over Sf^\. Then the (5+1) -fold average 
function J (27) and the dependency coefficient A Pj „ ifci 5 (29) satisfy 

(21) J(fz, 1 ,...,z, s+1 ) = l + 0((q/ P y*), 

(22) II A II -S °((9/P))> 

where ip — and ip — 1 for correlation and partial correlation hub screening, respectively, 
and 75 = (p + (5 + 1)(1 - </?). 

Proof of Proposition 3: 

The proof of Proposition 3 is given in the Appendix. □ 

5. Experiments 

5.1. Numerical simulation study. Figure 1 shows the waterfall plot of partial correlation 
hub p- values for a sham measurement matrix with i.i.d. normal entries that emulates the NKI 
experimental data studied in the next subsection. There are n = 266 samples and p = 24, 481 
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Figure 1 . Waterfall plots of p- values for partial correlation hub screening of a 
sham version of the NKI dataset [4] . The data matrix X has n = 266 rows and 
p = 24,481 columns and is populated with i.i.d. zero mean and unit variance 
Gaussian variables (S = 1). Dots on the curves correspond to unnormalized 
log p- values (called A- values) of discovered variables whose partial correlations 
exceeed the initial screening threshold p* = 0.26. Each curve is indexed over a 
particular vertex degree threshold parameter 5 ranging from 5 = 1 to 5 = 5, the 
maximum vertex degree found. Legends indicate the total number of variables 
discovered having vertex degree d. Left: plot of \s !Pi (s)(i) = — log(l —pvs(i)). 
Right: same as right panel but A-values are plotted on log scale. 



variables in this sham. Using (8) with parameter c n _s = a, n 5, the critical phase transition 
threshold on discoveries with positive vertex degree was determined to be pi iC = 0.296. 
For purposes of illustration of the fidelity of our theoretical predictions we used an initial 
screening threshold equal to p* = 0.26. As this is a sham, all discoveries are false positives. 

To illustrate the fidelity of the theoretical predictions waterfall plots of p-values (12) are 
shown in Fig. 1. For clarity, the figure shows the X-value defined as Xs, Pt {8) = — log(l — 
pvs(i)). When presented in this manner the leftmost point of each curve in the waterfall 
plot occurs at approximately (p* , E[N§ tP *]) , as can be seen by comparing the second and 
third columns of Table 1. This table demonstrates strong agreement between the predicted 
(mean) number of partial correlation hub discoveries and the actual number of discoveries for 
a single realization of the data matrix. The realization shown in the table is representative 
of the many simulations performed by us. 



5.2. Parcor screening of NKI dataset. The Netherlands Cancer Institute (NKI) dataset 
[4] contains data from Affymetrix GeneChips collected from 295 subjects who were diagnosed 
with early stage breast cancer. In Peng et al [16] a graphical lasso method for estimating 
the partial correlation graph was proposed and was applied to this dataset. Here we apply 
partial correlation hub screening. 

As in Peng etal [16] we only used a subset of the available GeneChip samples. Specifically, 
since 29 of the 295 GeneChips had variables with missing values, only 266 of the them were 
used in our analysis. Each GeneChip sample in the NKI dataset contains the expression levels 
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observed degree 


# predicted (E[N 5iP *}) 


# actual {Ns !P *) 


di > 5 = 1 


8531 


8492 


di > 5 = 2 


1697 


1635 


di > S = 3 


234 


229 


di > 8 = 4 


24 


28 


(i, > 5 = 5 


2 


4 



Table 1. Fidelity of the predicted (mean) number of false positives and the 
observed number of false positives in the realization of the sham NKI dataset 
experiment shown in Fig. 1 



x degrees ranges d [ >0 J ...57 




Figure 2. Waterfall plot of p- values for NKI gene expression dataset of [4] 
plotted in terms of loglog(l — pv${i))~ l . The genes plotted correspond to 
vertices of positive degree in the initial partial correlation graph with threshold 
p* = 0.35. Each curve indexes the p-values for a particular degree threshold 
5 and a gene is on the curve if its degree di in the initial graph is greater 
than or equal to 5. The discovered vertex degree ranges from 1 to 58 (last dot 
labeled IL14 at bottom left). The p- value trajectories across vertex degree S 
are indicated for several genes of interest. Note that all three Affymetrix array 
replicates of IGL@ were discovered. 
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Figure 3. Same as Fig. 2 but showing the p- value trajectories of 6 of the 
hub genes ('BUB1' 'CENPA' 'KNSL6' 'STK12' 'RAD54L' 'ID-GAP') reported 
in Peng et al [16]. These are the genes reported in Table 4 of [16] that have 
unambiguous annotation on the GeneChip array. 



of 24,481 genes. Peng etal [16] applied univariate Cox regression to reduce the number of 
variables to 1,217 genes prior to applying their sparse partial correlation estimation (space) 
method. In contrast, we applied our partial correlation hub screening procedure directly to 
all 24,481 variables. 

An initial threshold p* = 0.35 > pi c = 0.296 was selected. Figure 2 illustrates the wa- 
terfall plot of p-values of all discovered variables. Note in particular the very high level of 
significance of certain variables at the lower extremities of the p-value curves. According to 
NCBI Entrez several of the most statistically significant discovered genes on these strands 
have been related to breast cancer, lymphoma, and immune response, e.g. ARRB2 (Arrestin, 
Beta 2), CTAG2 (Cancer /testis antigen), IL14 (Interleukin) , and IGL@ (Immunoglobin al- 
pha). The p-value trajectories (colored labels) across different values of 5 of these four genes 
is illustrated in the figure. Note that some genes are statistically significant only at low ver- 
tex degree (CTAG2) while others retain high statistical significance across all vertex degrees 
(IGL@). Fig 3 is the same plot with the trajectories of the 6 unambiguously annotated hub 
genes given in Table 4 of Peng et al[16]. While these 6 genes do not have nearly as high 
p-values, or as high partial correlation, as compared to other genes shown in Fig. 2 their 
p-values are still very small; less than 10~ 25 . 
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6. Conclusions 

We treated the problem of screening for variables that are strongly correlated hubs in 
a correlation or partial correlation graph when n < p and p is large. The proposed hub 
screening procedure thresholds the sample correlation or the pseudo-inverse of the sample 
correlation matrix using Z-score representations of the correlation and partial correlation 
matrices. For large p and finite n asymptotic limits that specify the probability of false 
hub discoveries were established. These limits were used to obtain expressions for phase 
transition thresholds and p- values under the assumption of a block-sparse covariance matrix. 
To illustrate the wide applicability of our hub screening results we applied it to the NKI breast 
cancer gene expression dataset. 
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7. Appendix 

This appendix contains two subsections. Section 7.1 gives the necessary definitions. Sec- 
tion 7.2 gives proofs of the theory given in Sec. 4 . 

7.1. Notation, Preliminaries and Definitions. 

• X: n x p matrix of observations. 

• Z = [Zi, . . . , Z p ]: (n — 1) xp matrix of correlation (or partial correlation) Z-scores 
{Zj}j associated with X. 

• Z T Z: p x p sample correlation matrix R (if Z = U) or sample partial correlation 
matrix P (if Z = Y) associated with X. 

• ZfZj-: sample correlation (or partial correlation) coefficient, the i, j-th element of 
Z T Z. 

• p G [0,1]: screening threshold applied to matrix Z T Z. 

• r = a/2(1 — p): spherical cap radius parameter. 

• S n - 2 - unit sphere in R n_1 . 

• a n — 1 5^-2 1 : surface area of S n -2- 

• Gq(<&): graph associated with population correlation matrix f = T or partial corre- 
lation matrix <fr = f2. An edge in Go{&) corresponds to a non-zero entry of 

• Q p — Gp{&)'- graph associated with thresholded sample correlation matrix $ = R or 
partial correlation matrix $ = P. Specifically, the edges of G p (&) are specified by 
the non-diagonal entries of Z T Z whose magnitudes exceed level p. 

• df observed degree of vertex i in G p (&), & G {RjP}- 

• 5: screening threshold for vertex degrees in Q p (G>), $ G {R, P}. 

• k: upper bound on vertex degrees of Go(&), & £ {F, 

• Ng tP : generic notation for the number of correlation hub discoveries (A^j )P (R)) or 
partial correlation hub discoveries (Ns iP (P)) of degree di > 5 in <? P (R), or G P (P), 
respectively. 

• Ns, p counts the number of subsets of 5 mutually coincident edges in Q p . 

• A(r,z): the union of two anti-polar spherical cap regions in S' n _ 2 of radii r = 
a/2(1 — p) centered at points — z and z. 

• Pq: probability that a uniformly distributed vector U G S n -2 falls in A(r,z) 

I" 1 n-4 

P = P (p,n) = a n / (l-M 2 ) 2 du 
(2oJ Jp 

= (n - 2)^a„(l - p 2 )(- 2 V 2 (l + 0(1 - p 2 )), 
where a„ = 2B((n — 2)/2, 1/2) and B(l,m) is the beta function. 

For given integer k, < k < p, and <3> either the population correlation matrix T or the 
population partial correlation matrix f2 define 

min(k,di) 

i=i 
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where di denotes the degree of vertex % in Go(&) an d the maximization is over the range of 
distinct ji G {1, . . . ,p} that are not equal to i. When k > di these are the indices of the di 
neighbors of vertex i in Go(&)- When k < d { these are the subset of the /c-nearest neighbors 
(/c-NN) of vertex i. For the sequel it will be convenient to define the following vector valued 
indexing variable: i = (io, . . . ,is), where < 5 < p and io, . . . ,i$ are distinct integers in 
{1, . . . ,p}. With this index denote by Zj the set of 5 + 1 Z-scores {Z i:j }j =0 . 

Define the set of complementary fc-NN's of as Z j4fc ^ = {L\ : I G A k (i)}, where 

(25) A k (I) = (utoA4W) C -{^}, 

with A c denoting set complement of set A. The complementary fc-NN's include vertices 
outside of the /c-nearest-neighbor regions of the set of points Z^. 

Define the 5-fold leave-one-out average of the density, a function of i, /z il ,...,z i(5 ,Zr 



(26) /z, 1 _i,.,z» i _ i ,z j (zi, . . . , Zi, Zi 



/ _ -j\ -1 v 

] ( s J /z ill ..,z ij ,z i (sizi,...,sjz { ,z i ), 

{-1,1} V ' 



si,...,s 5 e{ 

where in the inner summation, indices i±, . . . , is range over {1, . . . , p}. Also define the (5+1)- 
fold average of the same density 



(27) /z. 1 ,...,z, 4+1 (zi,...,z { ,z i ) 

= P _1 (l/z.i-iv.Z^-i^^Zl, • • • , Ztf, Zi) + |/z n _ i ,...,Z» 5 _ i! Z i (Zl, . . . , Z 5 , -Zj) j . 

1=1 

For any data matrix Z define the dependency coefficient between the columns Z^ and their 
complementary fc-NN's 



(28) A PiB , M (i)= (/zjiz^^ - /zjV/z, 

and the average of these coefficients is 



oo 



(29) l|A Pl „, M ||i = M P tf J J Yl A p,n,kA*)- 

^ ^ ' ' io=l n<...<ii 

where the second sum is indexed over ii, . . . ,i$ ^ io- 

The coefficient (29) quantifies weak dependence of the Z-scores. If, for all i, and its 
complementary fc-NN neighborhood variables are independent then || A P!rii k t s\\i = 0. When 
the rows of X are i.i.d. and elliptically distributed, and Z = U are the standard correlation 
Z-scores, then a sufficient condition for HAp^^Hi = is that Go(&) have no vertex of degree 
greater than k or, equivalently, that the dispersion matrix X be row sparse of degree k. 

Finally, for arbitrary joint density fz u ...,z s (zi, ■ ■ ■ ,z s ) on S 5 n _ 2 = x s i=1 S n _ 2 , define 

(30) ^(/ Zl ,...,zJ = l^l 5 - 1 / / Zl ,...,z 5 (z,...,z)dz. 

J S„-2 
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7.2. Proofs of theorems. Proof of Prop. 1: 

With (pi = I(di > 5) we have N$ tP = ^?=i Define </>jj = / (Zj G A(r, Zj)) the indicator 
of the presence of an edge in G p (&) between vertices % and j, where A(r, Zj) is the union of 
two antipolar caps in S n -2 of radius r = a/2(1 — p) centered at Zj and — Zj, respectively. 
Then 0j and faj have the explicit relation 

p-l z p-l 

(si) ^ = e e n (i-fco 

where we have defined the index vector = (&i, . . . , fc p _i) and the set 

Cj(p- 1,0 = {A; : fci < . . . < k h k l+l < ... < k v _ x kj G {1, . . . ,p} — {i},kj ^ kj>}. 

The inner summation in (31) simply sums over the set of distinct indices not equal to i 
that index all ( p ~ l 1 ) different types of products rij-=i fak, Yl^=i+i(^ ~ ^WJ- Subtracting 

^keCiip-1,8) U. S j= i from both sides of ( 31 ) 

(32) <p l ~ Yi 

keCi(p-i,8) i= l 
p— l i p— i 

(33) = e e n 

l=5 + l keCi( P -i,i) i =1 m=i+i 

p— 1 ; m 

(34) + E E ^r- 1 e ik n **» 

kedi (p- i,o ™=H-i fc ;+1 <...<fc m i=i n=/+i 

where, in the last line we have used the expansion 

p— 1 p— 1 m 

n a - <kj = 1 + E e n 

m=Z+l m=/+l t[ +1 <...<fc m «=l+l 



The following simple asymptotic representation will be useful in the sequel. For any 
ii,...,i fc G {l,...,p}, ii ^ ••• =£i k ^ i, A; G {l,...,p- 1}, 



(35) £ 



b=i 



/ dv / dui--- / du fc /u il) ...,u i . 1 u i (ui,--- ,Ufc,v) 

./S„_ 2 JA(r,v) JA(r,v) 



'A(r,v) 



(36) < P^a^Mfen 

with P = Po(p, n ) defined in (23), a n = | *SVi 2 1 , and 

(37) M fc |i = max / Zil ,...,z ijb |z 



fc ' fc+1 
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The following simple generalization of (36) to arbitrary product indices (fiij will also be needed 



(38) 



E 



n 



H3l 



<P m a™M M , 



where Q =unique({ij, < 7/}™i) is the set of unique indices among the distinct pairs {(ii,ji)}i=i 
and M\q\ is a bound on the joint density of Zq. 



Define the random variable 
(39) Oi ~- 



p — 1 
5 



kedi(p-i,8) 3=i 



We show below that for sufficiently large p 

f p — \ 



(40) 



m] - 



E[6i 



8+1 



where 7 Pi<5 = max 5+1 <i <p {a^M Z |i} (e - £ 5 



4=0 /! 



< lp , s ((p - 1)P Q ) 



1 + (5!) : ) and is a least upper bound 



on any /-dimensional joint density of the variables {Zj}^ conditioned on Zj. 

To show inequality (40) take expectations of (34) and apply the bound (36) to obtain 

f p — r 



< 



(41) 
where 



P-i 

E 

=5+1 

<A(i+(5!n, 



p — i 

Z 



Po« Z ^|i + 



p-i 

E 

2=<5+l 



p — 1 

(5 



p — 1 



p-l-8 

E 

z=i 



p-1 - <S 



Z 



The line (41) follows from the identity ( P ~]~ S ) (Y) = (i+J)^ 1 )" 1 and a change of index in 
the second summation on the previous line. Since (p — 1)P < 1 

i((p-W 



|A| < max K7%} V 

~ Z=<5+1 



p — 1 
I 



Application of the mean-value-theorem to the integral representation (35) yields 

(42) \Em-pSj{fo—^~^,)\ < %A(p-±)Po) 5 r, 

where 7^5 = 2a^~ 1 Ms + i\i/5\ and M$+i|i is a bound on the norm of the gradient 
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Combining (40)-(42) and the relation r = 0((1 — p) 1 ^ 2 ), 

(43^M-(^ 1 )p ^(/z n ,...,z, +1 )| < O(((p-l)P ) 5 max{(p-l)P ,(l-p) 1/2 }). 

Summing over % and recalling the definitions (14) and (15) of £p, n ,5,p an< ^ Vp,s, 

E[N 5 J - £ p , nAp J(f z ^..^ s+ j\ < C(p((p-l)Po) 5 max{(p-l)P ,(l-p) 1/2 }) 

(44) = 0{ V l s m^{ W - 1/5 ,(l-p) 1/2 }). 

This establishes the bound (16). 

For the bound (17) we use the Chen-Stein method [2]. The part of the bound (17) that 
holds for 5 = 1 was derived in the course of proof of Prop. 1 in [12]. Below we treat the case 
5 > 1. Recalling the definition N$ tP as the number of subsets of 5 mutually coincident edges 
in Q p , we have the representation: 

v s 



(45) 



N, 



yi n ° 

io=l ii<...<ig j=l 



def 



where the second sum is indexed over i 1: ...,ig ^ i . For i = (i 0: i 1: . . . , is) define the 
index set B-; 



B 



l ,li,...,ls 



{(jo, ji, ...,js) ■ 3i e M k {k) U I = 0, . . . ,5} n C< where 
C K = {(jo, -..Js) - jo e {1, • • • ,p}, 1 < ji < ■■■ < js < p,ji, ■ ■ ■ ,3s ^ jo}- These index 
the distinct sets of points = {Z io , Z ii: . . . , Z ig } and their respective fc-NN's. Note that 

\B^\ < k 5+l . Identifying N S , P = J2iec< Ilf=i faok an< ^ ^s, P a Poisson distributed random 
variable with rate E[Ns, P ], the Chen-Stein bound [2, Thm. 1] is 



(46) 
where 



2 max \P(Ns, p £ A) - P(Ar| € X)| < by + h + h, 



n 

.1=1 



'urn 



E 



E E E 

iec< jeB~{i} 



n 

m=l 
S 



n n 



J = l 



m=l 



and, for p- = E[U l=1 (f) ioil ], 



iec< 



E 



n - 



Over the range of indices in the sum of 6i ^[Ilf=i0««] is of order 0{P$), b y ( 38 )> and 
therefore 

h < O (p^k^P?) = O (V 2 p * 5 (k/P) s+1 ) , 
which follows from definition (15). More care is needed to bound b<i due to the symmetry 
relation 0^ = If in the summations defining 6 2 , i = j m and jo = k occur for some 
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l,m then there will be a match and 4'i i l <t>j o j m = <Pi i r In such case the summand of 62 will 
be of lower order than O(P 2<5 ). For example, for the case that l,m — 1 a match implies 

4ou = fijoh and > from ( 38 )' 

^n^i, n = n = ° • 

Z=l m=l i=l m=2 

Over C < and P^* — {2} there can be no more than a single match in 62's summand. For a 
given match there are at most p 5+1 A;' 5 ~ 1 summands of reduced order. We conclude that 

b 2 < 0{p &+1 k 5+1 Pl & )^0{p &+1 k 5 - 1 P^- 1 ) 

= O (v 2 p 5 s( k M S+1 ) + O (rfa-^k/py-ip-V-W) , 

which follows from the relation p 25 P 25_1 = (p 5+1 P$) 2 - 1/s /p (5 - 1)/s . 

Next we bound the term 63 in (46). The set Ak(i) = BZ — {i} indexes the complementary 
fc-NN's of so that, using the representation (38), 
s 



!ec< 



E 



n 

1=1 



'im 



= E 



-Pi 



n 

j=i 



S n -2 



/z.|z Afc (z ? |z Afe(?) ; 



A(r,z io ) 



< O (p^PolAp.n.MlK) = O (^HA^IK) . 
Observe that, with A = P[A^ p ] 

|P(A^>0)-(l-exp(-A))| < \p(Ns, P >0)-P(Ns, p >0) 

+ |p(A\ p > 0) - (l - exp(-E[N S , p ]j) 
+ |exp(-P[A\ p ]) -exp(-A) 
(47) < 6i + 6 2 + 6 3 + 0f P[A>, 



A 



Combining the above inequalities on bi, 62 and 63 yields the first three terms in the argument 
of the "max" on the right side of (17). 

It remains to bound the term ji^A^] — A|. Application of the mean value theorem to the 
multiple integral (38) gives 

s 



(48) E 
Applying relation (45) 



.1=1 



PqJ ^/z il ,...,z ii5 ,z i 



< 0(P*r). 



(49) 



E[N s J-p 



p — 1 

5 



PqJ (/z n ,...,Z. 4+1 



< 0(p s+l P*r)=0(r) S Pt5 r) 



Combine this with (47) to obtain the bound (17). This completes the proof of Prop. 1. □ 
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Proof of Cor. 1 : 

For correlation hub screening (Z = U) ||A Pj7l) fc )( j||i = so it suffices to consider the 
other arguments of "max" in the bound (17). As in the proof of Prop 2, (1 — p p ) 1//2 = 
O (p~( 5+1 )/(( n ~ 2 ) 5 )) and we can merge the last two terms in (17) into the single term = 
max {p -1 / 5 , (1 — pp) 1 / 2 }, with a defined in the Corollary statement. Finally, note that 
Vp S = 0(1) and (k/p) 5+1 > Q pA& = (k/p) 5 ~ 1 p~ < - s ~ 1 ^ 5 when k/p > p -V-i)/{2S) _ Therefore, as 
a < (5 — 1) / (2(5) when n > 3, we conclude that if k/p > p~ a all arguments of "max" in the 
bound (17) are dominated by (k/p) s+1 . Turning to partial correlation hub screening (Z = Y), 
under the block-sparse covariance assumption Prop. 3 asserts that || Ap^^Hi = 0{k/p) 
which dominates (k/p) 5+1 . This completes the proof of Cor. 1. □ 

Proof of Proposition 3: 

By block-sparsity, the matrix U of Z-scores can be partitioned as U = [U, U], where 
U = [Ui, . . . , V q } and U = [Ui, . . . , U p _ g ] are the dependent and independent columns of U, 
respectively. Since the columns of U's are i.i.d. and uniform over the unit sphere S n -2, as 
p — > oo we have 

(p - qy'V U T -> = {n- l)- 1 ^-!. 

Furthermore, as the entries of the matrix q^tstS 7 are bounded by 1, 

p~ 1 UU T = 0{q/p), 

where O(-u) is an (n — 1) x (n — 1) matrix whose entries are of order 0(u). Hence, as 

'-p - — - - — ■ 

UU T = UU + UU T , the pseudo-inverse of R has the asymptotic large p representation 



(50) Rt = (^-iyv T [I^ 1 + 0(q/p)]- 2 V= (^-T)Vu(l + 0(q/p)), 

which establishes (20). 

Define the partition C = Q U Q c of the index set C = {(i , . . . , is) : 1 < io • is < p} 
where Q = {(i , . . . ,is) : 1 < ii < q, 1 < / < 5} is the set of (d + l)-tuples restricted to 
the dependent columns U of U. The summation representations (27) and (29) of J and 
||A P) „ )M ||i yield 

(51) J(fz. lt ...,z. s+1 ) = |C|- 1 (e + E) J (K^ 
and 

(52) IIAp.n.Mlll = l^r 1 S + S \,n,k,S0)- 

\ieQ %£Q.) 

For correlation hub screening (Z = U) A Ptritkt s(i) = for all i e C while, as the set 
{U io , . . . ,UiJ's are i.i.d. uniform for ie Q c , J(fz io ,...,z i5 ) = 1 for i6 Q c . As J(/ Zio ,...,z i4 ) is 
bounded and \ Q\/\C\ = O ({q/p) s+1 ) the relations (21) and (22) are established for the case 
of correlation screening. 
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For partial correlation hub screening (Z = Y) then, as Y = [I n -i + 0(q/p)] 1 U, the joint 
densities of Y and U are related by /y = (l + 0(q/p))fc. Therefore, over the range i Q, the 
J and Ap, n ,k,5 summands in (51) and (52) are of order 1 + 0(q/p) and 0(q/p), respectively, 
which establishes (21) and (22) for partial correlation screening. This completes the proof 
of Prop. 3. □ 
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